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Abstract 



Signal estimation from incomplete observations improves as more signal structure can be exploited in the 
inference process. Classic algorithms (e.g., Kalman filtering) have exploited strong dynamic structure for time- 
varying signals while modern work has often focused on exploiting low-dimensional signal structure (e.g., sparsity 
in a basis) for static signals. Few algorithms attempt to merge both static and dynamic structure to improve estimation 
for time-varying sparse signals (e.g., video). In this work we present a re-weighted l\ dynamic filtering scheme for 
causal signal estimation that utilizes both sparsity assumptions and dynamic structure. Our algorithm leverages work 
on hierarchical Laplacian scale mixture models to create a dynamic probabilistic model. The resulting algorithm 
incorporates both dynamic and sparsity priors in the estimation procedure in a robust and efficient algorithm. We 
demonstrate the results in simulation using both synthetic and natural data. 



Collecting data can be an expensive process, especially for high-dimensional time-varying signals. In many 
applications we wish to significantly reduce the burden on the front-end measurement process while still accurately 
estimating a signal (or state) of interest. One tenet of modern signal processing is that decreasing the number of 
observations requires exploiting more signal structure in the inference process to get quality estimates. Two sources 
of signal structure that have been especially successful in many applications are dynamics (e.g., temporal regularity) 
and sparsity (e.g., compressibility in a basis at a given time). 

Using the language of mathematics, we focus on the problem of estimating the time-varying state x n G M. N , that 
evolves according to a Markov process 
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I. Introduction 
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where f n (•) : R. N — > R N is the operator (assumed known) describing the system evolution with some error u n 
G R N (called the innovation). The state is assumed to be hidden and is estimated from noisy linear measurements 

y G R M : 

Un — G n x n + e ra , (2) 

where G n G R MxAr is the (known but potentially time- varying) measurement matrix with some error e n G R M . 
When the number of unique measurements is identical to the size of the state (M = N and G n is full rank), the 
measurement matrix is invertible and state estimation at each time step is straightforward. In the more interesting 
(and realistic) case the number of measurements at each time step is significantly less than the number state 
parameters (M <C N) and G n is singular. Under these conditions, quality estimates require regularization using a 
model of the signal structure. 

Knowledge of the dynamic operator f n (•) is obviously powerful and has been used to regularize classic estimation 
techniques for decades. In this work, we focus on causal estimation (known as filtering) where the current state is 
estimated from past and present measurements. While a number of schemes have been established to use system 
dynamics as a prior distribution for state estimation, the predominant set of tools stem from the seminal work by 
Kalman in the 1960's (TJ. The classic Kalman filter is an optimal and causal estimator for the special case of 
linear dynamics (/ (•)) and Gaussian statistics (measurement errors, innovations, and the prior on the initial state 
Xo). The Kalman filter uses a one-step update that combines the current measurement, the past state estimate, and 
the covariance matrices of the random quantities (the measurement noise, the innovations, and the previous state 
estimate). The astonishing feature of this estimator is that the estimate is globally optimal (i.e., optimal conditioned 
on all past measurements) because the propagating covariance estimates are sufficient to capture the entire model 
statistics at the current state under the linear and Gaussian assumptions. If the assumptions of the model are satisfied, 
the state estimates will continue to improve at each iteration until the fundamental noise floor is reached (regardless 
of the number of measurements per iteration). 

The past decade of work in signal and image processing has demonstrated the power of leveraging the non- 
Gaussian statistics present in many natural signal families. In particular, outside of the dynamic estimation regime, 
signal priors based on the notion of sparsity have lead to state-of-the-art performance in many linear inverse problems 
(e.g., undersampling, inpainting, denoising, compressed sensing, etc. [2 ]) across several different application domains 
(e.g. natural images Q, audio (3j, hyperspectral imagery [4], etc.). In this paradigm, a static signal x is assumed 
hidden and is only observed through a linear measurement proces^j] y = &x + e. In contrast to the classical 
technique of using a least-squares estimate (e.g., corresponding to a Gaussian prior on the state), state-of-the-art 

'For static problems we will use <!> as the observation matrix and for dynamic problems we will use G in order to better differentiate the 
two cases. 
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approaches tend to perform a penalized least-square estimate, stated generally as 



x = argmm 

X 



\y - §x\\l + 7C (ac) 



(3) 



where 7 trades off between mean-squared error and a cost function C(-) that prefers signals that are sparse in a given 
basis W. The sparsity model has received significant attention in the recent results on compressed sensing (CS) J5J, 
j6j, where this low-dimensional structure has been particularly successful as a strong regularizer to overcome a 
lack of measurements (discussed in detail in Section |II-A| ). 

While estimation approaches separately exploiting signal structure from dynamics and sparsity have been enor- 
mously successful, there is comparatively little work incorporating both types of information jointly. Sparsity priors 
are often more realistic than Gaussian assumptions for natural signals and can be especially effective for high- 
dimensional signals where Kalman approaches are computationally inefficient (due to storing and inverting large 
matrices). On the other hand, temporal regularity can be the strongest source of information about a signal, and 
Kalman approaches exploit this successfully using straightforward computations. Given the success of each of these 
models individually and the different types of signal structure they capture, there are many applications (e.g., CS 
for video, dynamic MRI, etc.) where one can reasonably expect significant potential gains from jointly leveraging 
both types of signal structure. 

The previous work integrating sparsity models into dynamic estimation has not yet followed the spirit of the 
successful Kalman filter by propagating higher-order statistics from one time step to the next in a way that is well- 
matched to the assumptions of the problem]^] In this work we propose a causal estimation method for dynamic sparse 
signals that propagates higher-order statistics to integrate uncertainty naturally into the sparse inference process. 
Specifically, we make use of hierarchical probabilistic models [8] that allow for modulations in the sparse estimation 
problem to account for information from previous time steps. This algorithm is based on the re-weighted l\ (RWL1) 
optimization program |9j, [ 10 1 that has been successful in static sparse signal estimation and is more computationally 



efficient than many alternative approaches (e.g., it does not require storage or inversion of covariance matrices and 

the fundamental optimizations can leverage recent advances in efficient l\ optimization). Additionally, this approach 

is particularly robust to model mismatch (i.e., shot-noise in the innovations). We demonstrate the properties of this 

approach on both simulated data as well as natural video sequences. 

2 As is pointed out in (7), propagating a covariance matrix is no longer necessarily the correct or optimal method once the assumptions 
of the Kalman filter are not met. Instead, such a matrix is no more special than a simple parameter matrix, which may or may not assist in 
the signal estimation. 
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II. Background and Related Work 

A. Sparse Signal Estimation 

As mentioned above, one of the most successful signal models in modern signal processing has been the 
sparsity model, characterizing one type of low-dimensional structure in high-dimensional signals. Specifically, in 
this synthesis model the signal can be written 

x = Wz, 

where the coefficient vector z € M. N ' is mostly composed of zeros and the dictionary W may be orthonormal or 
overcomplete. In the language of Bayesian inference, this model assumes that the prior distribution on z has high 
kurtosis to encourage coefficients to be zero (or close to zero). The most commonly used prior is for the coefficients 
to be independent and identically distributed (IID) with the Laplacian distribution 

p(z) = l e -7lWl 1; 

where 7 _1 is the variance and indicates the l\ normj^] 

When the IID Laplacian prior is used in addition to a linear measurement model and white Gaussian noise 
distributions, the MAP estimate reduces to solving 



arg mm 

z 



1 ,,2 

^\\y-<f>Wz\\ 2 + ~/\\z\\ 1 
x = Wz, 



(4) 



where a\ is the variance of the measurement noise. This optimization program is referred to as Basis Pursuit 
De-Noising (BPDN) and has been used extensively in the signal processing community over the past decade for 
signal estimation. BPDN has been a particularly popular approach because of its strong performance guarantees 



(e.g., conditions for equivalence to a combinatorial minimization of the number of non-zero coefficients [11 1) 



and the emergence of many specialized optimization algorithms (e.g. interior point methods |12|, |13|], homotopy 



methods [14], gradient projection methods [15], and analog systems [16|, |17|). Note also that BPDN jointly 
estimates the signal support and the coefficient values. If a support set were estimated separately, estimating 
coefficient values would be trivial (i.e., least squares) but errors in the support set would potentially cause significant 
unrecoverable errors in the total estimation. 

In Q, the variance parameter 7 for the coefficient distribution is assumed known and identical for each coefficient. 
The RWL1 algorithm [|9j was introduced to broaden the scope of this model, and was later described [8] in the 
context of a hierarchical probabilistic model. This probabilistic model, known as the Laplacian scale mixture (LSM), 
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*lli = £il*[*H- 



introduces a second layer of random variables A G M. N representing the variance of the first level Laplacian 
coefficients. In short, the LSM allows variable SNR on each coefficient, and infers the coefficient variances instead 
of assuming they are known and fixed. The LSM uses the Gamma distribution (due to it being a conjugate prior 
for the Laplacian) as the hyperprior on 



where V {•) is the Gamma function and a and 9 are parameters for the hyper-priors with i£[A[i]] = a6 and 
Vor[A[i]] = a9 2 . 

Assuming a = 1 and the Laplacian conditional distribution on the coefficients p (z[i]|A[i]) = ^ie~ A Wl z WI, the 
MAP estimate for the full model can be written as 

{z,A} = argmin||y- + |A[i]z[i]| - log | ^ A[i] J +^ 1 J]A[i] (5) 

x = Wz. 

Though this optimization program is difficult to solve directly due to the interaction between A and z, setting 
up the expectation-maximization (EM) for this estimation |8| results in the RWL1 algorithm where the following 
equations are iterated until a convergence criteria is met: 



argmin||y - $Wz||2 + A ^ | A*[i]z[i] | (6) 



A<+1 W = r~Jj^ > (7) 

|z*[fc]| + r? 

where t indicates the iteration number and Ao, (3 and r\ are constants resulting from the hyperprior parameters 
(which can be set due to model information or to improve algorithmic stability). After the algorithm is run to 
completion, the desired signal is reconstructed with x = Wz. Several papers have followed up on this RWL1 



approach, including description of it as a compromise between i\ and £o minimization |10|, [18|, and analysis of 
performance guarantees p9| , |20|. 

Similar to BPDN, RWL1 is still performing joint estimation of the support set and the coefficient values. However, 
in contrast to BPDN, RWL1 uses a multi-stage estimation procedure that uses information from a previous iteration 
to try and focus the estimate even more accurately. Note that coefficients with little evidence in the initial iteration 
will have their corresponding variance decreased so that a future iteration is more likely to try and cluster that 
coefficient around zero, and coefficients with strong evidence in initial iterations will have their corresponding 
variance increased to make future iterations more likely to use that coefficient in the support set. By using second 
order statistics to try and focus the support set estimation, RWL1 gets some benefits from seeking a sparser solution 
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but is still flexible enough to be more robust than a rigid support set estimation. 

B. Dynamic Signal Estimation 

In dynamic filtering, knowledge of the system dynamics is used to set a prior probability distribution that 
regularizes the estimation to counteract insufficient measurements]^] In the classic Kalman filter setup, this process 
simplifies dramatically due to the assumption of Gaussian statistics, a linear measurement function and linear system 
dynamics (i.e. f k (a:) = F k x). In this case, assuming v k ~ M (0, Q), the conditional distribution (serving as a prior) 
on the state at time k in terms of the previous state is p(x k \xk_i) ~ M (Fjf.Xf.-i, Q)- Similarly, assuming ~ 
M (0, R), the likelihood on the measurements at time k in terms of the current state is p(yk\xk) ~ M (GkXk, R)- 
Using the Markov property of the state distributions, the joint MAP estimate over all time is 



{xk}k=o...n = arg min 

{x k }k=o.. 



where the kerneled norm is defined as ||a;| 



^WVk ~ G k X k \\ 2 R ^ + 22 \\ X k ~ ^fc^fc-ll 



2 

Q,2 



.fc=0 



k=l 



(8) 



fi,2 



x T R l x. Note that the first term in this optimization is a 



likelihood term encouraging consistency with the measurements, and the second term acts as a regularization prior 
on each state encouraging consistency with the previous estimate and the assumed dynamics. Unfortunately, this 
joint state estimation often requires observation over many time steps to adequately regulate the estimate, resulting 
in large optimization problems and long buffering periods. 

To address the high computational cost and buffering time of joint state estimation, the Kalman filter 1 1 ] takes an 
elegant causal filtering approach that estimates only the current state using past estimates and low-rank updates]^] 
Using the basic properties of Gaussian random vectors and linear operators, we can write the marginal distribution 
on the current state as p(x n ) ~ N (x n -i, F n P n -iF^ + Q), where P n is the covariance matrix of the previous 
state estimate. This distribution serves as a prior for the current state, capturing information about all of the previous 
observations through the previous state estimate and its covariance matrix (which captures all of the uncertainty 
in that estimate due to the Gaussian assumptions). Thus the Kalman filter only requires the current measurement 
along with this single prior to infer the current state via the MAP estimate 



X; 



arg mm 



G n x 



n\\R.2 



+ \ \x 7 



F nX 



n^n— 1 



F„P„_ 1 F n T +Q,2 



(9) 



which has an analytic solution (known as the Kalman update equations |TJ). This causal estimation procedure is 

depicted in Figure [jja). Intuitively, the estimator propagates the previous state estimate (and its covariance matrix) 

through the known system dynamics to get a prediction (i.e., a prior) of the current state. This prior is combined 

4 While there are many different ways to derive the Kalman filter, we limit the discussion here to a Bayesian inference viewpoint. 
5 Though we focus on predicting the current state using Kalman filtering, the ideas are easily extendible to estimating past and future 
states, known as Kalman smoothing and Kalman prediction. 
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(a) Kalman Filter 



(b) Unscented Kalman Filter 




Sparsity Prior 

(C) BPDN-DF 
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(d) RWL1-DF 



Fig. 1. Information propagation in dynamic filtering algorithms, (a) The standard Kalman filtering equations propagate the mean and 
covariance (the two quantities needed to propagate Gaussian distributions forward) through the system dynamics to generate a prior distribution 
for the next state estimate. This prior is used in conjunction with the new measurements through the likelihood function (due to the distribution 
of the measurement noise) to estimate the new state and its distribution. The extended Kalman filter is similar with a linear approximation to 
non-linear dynamics, (b) The unscented Kalman filter (and particle filters in general) estimate the prior distribution via a sampling process 
that empirically approximates the distribution's moments, (c) Adding the sparsity prior in directly to the Kalman filter optimization (as in 
BPDN-DF) results in a regularization norm which does not promote sparsity as well as desired. Left: Previous state estimates can still be 
propagated through the dynamic model to generate a prior that can be combined with an additional prior to encourage sparsity (both in red). 
Right: Combining the two priors from the left diagrams shows a total signal prior (i.e., due to the prediction and the sparsity penalty) that 
is curved outward more like an £2 ball. Due to the convex shape, the total prior is less effective at promoting sparsity than the l\ ball, (d) 
In contrast, RWL1-DF uses the previous estimate to set the parameters of a prior that has the diamond-like shape known to promote sparse 
solutions. 



with the likelihood resulting from the current measurement to generate a posterior distribution that can be used 
to estimate the current state. Though the causal estimator is computationally much simpler than joint estimation 
and does not suffer from buffering delay, it still calculates the exact same solution for x n as the joint optimization 
program in ([8]). 

Unfortunately, the analytic simplicity and optimality guarantees of the Kalman filter are highly dependent on the 
linear and Gaussian model assumptions. Any deviation from these assumptions makes it impossible to completely 
characterize the past measurement and estimator uncertainty by simply propagating covariance matrices forward 
from one estimate to the next. Though not optimal estimators, many heuristic approaches have been introduced 
that follow the spirit of the Kalman filter while incorporating nonlinear system dynamics or general non-Gaussian 



structure. For example, the Extended Kalman Filter |21 1 incorporates (weakly) nonlinear system dynamics by using a 
linear approximation to f n (x) in the standard Kalman filter. An alternate approach for highly non-linear functions 
or non-Gaussian statistics is particle filtering, which uses discrete points (particles) to approximate the relevant 
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distributions. Particles are drawn from a distribution around the previous state estimate and can be propagated 
through a nonlinear dynamics function, creating estimates of a distribution for the prediction of the current state. 
While general particle filtering uses Monte Carlo techniques to draw particles from distributions, the Unscented 
Kalman filter [22 ] (depicted in Figure [TJb)) can be viewed as an example of this technique with deterministic 
methods for drawing the particles. Though particle filtering approaches do seek the true prior distribution, these 
methods can become intractable in high-dimensional state spaces due to the large number of samples needed to 
characterize the distributions. While each of these general approaches (and many more) have had some success, no 
classic techniques explicitly utilize information about sparsity structure in the model that has been so powerful in 
modern signal processing. In particular, dynamic estimation algorithms often have difficulty dealing with shot-noise 
in the innovations, where model errors are often small but occasionally very large (i.e., sparse innovations). 



C. Sparsity in Dynamic Signals 

With the potential to improve signal estimation in many important application, recent work has begun the 
important task of incorporating sparsity priors into dynamic signal estimation. Some of these methods take a 



batch processing approach that uses all measurements to jointly estimate the states over all time (e.g. |23|-|27|). 
While non-causal approaches are interesting for some applications, we focus here on causal (streaming) filters that 
estimate the current state sequentially as new measurements become available. 

One branch of recent work focuses on using temporal information to generate state estimations via efficient 
updates, mimicking the Kalman filter's ability to obtain the new estimate using computationally simple updates on 
the previous estimate. For example, efficient updates to programs like BPDN (e.g., when a new measurement is 



taken of a static signal) have been explored using homotopy methods [28], [29]. These methods, while utilizing the 
dynamic information implicitly to simplify the estimation procedure, do not incorporate explicit temporal structure 
to improve the quality of the estimate. 

Another branch of previous work focuses on using one-step updates to emulate the Kalman filter's use of 
temporally local estimates. One approach in this vein attempts to leverage the Kalman estimator directly by using 



a pseudo-norm in the update equations to encourage sparser solutions [30] and then enforcing an i\ constraint on 



the state after the Kalman update. Another approach [31], |32| takes a two-step approach, first performing support 
estimation using l\ cost functions and then running the traditional Kalman equations on a restricted support set. 
Both of these approaches essentially modify the Kalman filter equations directly (including the propagation of 
covariance matrices), despite the statistics of the problem being highly non-Gaussian. However, many applications 
which benefit most from sparsity models involve large states, making storage and inversion of full covariance 
matrices prohibitive. Additionally, while the work in pT| , |32| assume sparse innovations (with the condition that 
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the innovations sparsity is much less than even the state sparsity), the robustness to mis-matched innovations has 
not been explored. 

More recent approaches (e.g. |33|-|35|) start from the optimization framework in ([8]) rather than the specific 



Kalman update equations, utilizing a restricted dynamic model for the coefficients' temporal evolution. In these 
approaches additional norms are appended to the BPDN optimization to enforce the state prediction. Such norm 
restrictions make explicit assumptions on the innovations statistics (see the discussion in [35]) and are thus less 
robust to model mismatch. Additional models have considered more direct coefficient transition modeling via 



Markov models [36 1, |37| either by utilizing message passing to propagate support information through time [37| 



or by using the previous estimate to influence coefficient selection through modified orthogonal matching pursuit 



(OMP) [36]. These approaches either incorporate restrictive models designed for specific applications [33], |34|, 



[37 1, have limited robustness due to the fact that they strictly enforce a support set estimate [35], (36| or retain the 



covariance propagations from the Kalman setting [34]. None of these approaches strike a balance between utilizing 
dynamic models, adapting to improve robustness to model error, utilizing higher-order statistics native to sparse 
signal estimation, and retaining the computational efficiency found in either Kalman filtering and optimized BPDN 
solvers. 



Of the previous approaches to time-varying sparse signal estimation, the approach described in [35] and [34 J 



(similar to |24|, [33 1) is most similar in philosophy to the approach described in this paper. These algorithms add 
norms that penalize differences between the estimated signal and a prediction based on the previous estimate. The 
BPDN dynamic filtering (BPDN-DF) in (351 (and similar to one approach in p4| ) modifies the BPDN estimator 
to include a term that enforces consistency with the model prediction from the previous state estimate 

x n = argmin||?/ n - $VFz||2 + 7 \\ z \\ + k\\Wz - f (x n -i)\\ q , (10) 

where 7 and k are tradeoff parameters and q represents the norm used to penalize the prediction error. While BPDN- 
DF shows performance improvements over standard Kalman filters and independent BPDN estimation at each time 
step (i.e., treating each new state as a static signal by running standard BPDN without temporal information), 
this estimator essentially relies on passing only first-order information forward at each time in a manner that is 
inconsistent with the goal of sparse signal estimation. Specifically, Figure [TJc) depicts how the combined prior 
(the dynamics and the sparsity assumptions together, with or without a covariance matrix) takes a more "rounded" 
shape. This bulging of the combined prior starts to resemble the £2 ball, and is in conflict with the geometrical 
structures known in the community to be necessary to favor sparse solutions (e.g., the i\ ball) JfjJ, |38|. The 



fundamental problem with BPDN-DF is that the information propagated across each time step is not well-matched 
and integrated into the sparse signal model. Our approach attempts to overcome the shortcomings of current methods 
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Fig. 2. The RWL1-DF algorithm inserts the dynamic information at the second layer of the LSM model for each time step. The graphical 
model depicts the model dependencies, where prior state estimates are used to set the hyperpriors for the second level variables controlling 
the variances (i.e., SNRs) of the state estimates at the next time step. 

by preserving the sparsity inducing norm geometry while allowing for arbitrary dynamic models, second order 
information propagation suitable for high dimensional data, robustness to dynamic model error and computational 
ease utilizing the vast literature on efficient BPDN implementations. 

III. Re-Weighted l\ Dynamic Filtering 

In this section we describe the proposed RWL1-DF algorithm for the general dynamics model in ([T]) with the 
linear measurement process in Q. The main idea of the proposed method is to use the rich signal description 
available in a hierarchical sparsity model to propagate second-order uncertainty in dynamic signal estimation (akin 
to the covariance matrices in Kalman filtering). This approach leverages the LSM model and its connections to 
RWL1 optimization to build a computationally efficient causal estimator. The main technical innovation we propose 
is to use the hyper-priors in the LSM model to inject dynamic information into the sparsity inducing priors, using the 
variable coefficient SNRs to encourage or discourage (but not force) coefficient activity based on predictions from 
the previous state estimate. The resulting estimation procedure is depicted in Figure [TJd) and the graphical model is 
depicted in Figure [2j In contrast to BPDN-DF (e.g. Figure [jjc)), the proposed approach results in an optimization 
program that naturally uses second-order statistics while retaining the geometric characteristics most effective for 
encouraging sparse solutions. By modulating the variance of the coefficients being estimated (rather than restricting 
the support set directly), the proposed approach also demonstrates improved estimation and robustness for sparse 
states and innovations (i.e., model errors). 

Specifically, similar to the original LSM, the proposed model describes the conditional distribution on the sparse 
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coefficients at time k as a zero-mean Laplacian with different variances: 

^W|A fc ~^le-^WI^WI. (11) 



We also set the scale variables controlling the coefficient variance to be Gamma distributed: 

\, M „ A fc _M e -A fc [i]/flfc [i] (12 x 

and allow each of these variables to have different means, (a6 k [i] = E[X k [i\\) by modifying the value of k [i]. 

The expected value of each scale variable is set based on dynamic information from the previous state estimate. 
In particular, if the model prediction of z k [i] based on the previous state is large (or small), the variance of that 
coefficient at the current estimate is made large (or small) by making X k [i] small (or large). Large variances allow 
the model flexibility to choose from a wide-range of non-zero values for coefficients that are likely to be active 
and small variances (with a mean of zero) encourage the model to drive coefficients to zero if they are likely to be 
inactive. However, by "encouraging" the model through the use of second order statistics (instead of forcing the 
model to use a particular subset of coefficients through a separate estimation process) the model remains robust 
and flexible. In this work we specifically choose 



Ok\i]=Z{\W- 1 f k (Wz k - 1 )[i\\+r l ) \ 



where W^ 1 f k (Wzk-i) [*] is the fi 1 coefficient of the previous signal propagated through the dynamics, rj is a 
linear offset and £ is a multiplicative constant. Note that any general model for the dynamics is allowable and we 
are not restricting the system to linear dynamical systems. Note also that the absolute values are necessary since A 
determines a variance, which must be strictly positive. The parameter r/ determines the distribution of the variance 
when the coefficient is predicted to be zero, resulting in W~ x f k (Wz*k-l) [i] = 0. This parameter reflects the 
magnitude of the innovations in the erroneous model predictions. The joint MAP estimate of all model parameters 
becomes 



[z k ,X k ] = argmin||y fc - G k Wz\ \\ + (3 ^ | X[i]z[i 



[z,\] 

-a log (AH) + y^A[i] (iWVfcCWzfc-i) m + (13) 

i i 

which is identical to the MAP estimate in the LSM except for the appearance of a and the time-dependencies on 
the parameters. 



As in the LSM case, the optimization in( 13) is not easily solved jointly for both z k and X k , but the model yields 
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a simple form when using an EM approach. The precise steps of the iteration in this case are 



E step: X{ = E p(x \^ k) [A] 



M step: z k = arg min — log 



p 



(**|a) 



(14) 
(15) 



where t denotes the EM iteration number and E p (\\z k ) ['] denotes the expectation with respect to the conditional 
distribution p(X\z k ). We can write the maximization step as 



z\ = argnun||yjfc - GWz\\ 2 2 + A ^ | A* [i]z[i] 

i 



(16) 



since the MAP optimization conditioned on the A parameters reduces to a weighted t\ optimization. 

While the expectation step is often difficult to calculate, this model admits a simple closed-form solution. First 
we can use the conjugacy of the Gamma and Laplacian distributions to calculate the conditional distribution 

p(**|A)p(A) 



p(A|z fe ) 



P(Zk) 



which is separable in A. We can analytically write this distribution by evaluating 

a9Xo 



P (**[*]) = 



2(ex \z k [i]\ + i) a+1 

a (r'lw-Vk (wz fc _i) [»]| + r^P 



2 (A \z k \i]\ + £- 1| W-Vfe (n-i) [»]| + r^P 1 ' 
and the expectation can be calculated as 

(a + 1)9 



(17) 



[A] = 



0A o |z fe [z]| + l 



(a + l)C 



(18) 



^\z k [{\\ + \W~ 1 f k {Wz k ^)[^\+r ] - 

Putting the pieces of the EM algorithm above together results in an iterative re-weighted l\ dynamic filter 
(RWL1-DF): 



z k 



arg mm 

X 



ai^w = 



\y k - G k x\\l + A ^ |Afe[z]x[z 

i 

2t 



(19) 
(20) 



P\z\[i]\ + \W^h{Wz k ^)[i\\+r 1 ' 

where r = (a + 1) ^ is a constant scaling value, (3 = Ao£ can be interpreted as a tradeoff between the measurement 
and the prediction, and the signal of interest can again be recovered via x k = Wz k . The resulting optimization 
procedure looks nearly identical to the static RWL1 algorithm except that the denominator in the A update contains a 
term depending on the previous state. This term encourages smaller A values (i.e., higher variances) in the elements 
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that are predicted to be highly active according to (x^i). This graduated encouragement of coefficients selected 
by the prediction (rather than direct penalization) allows the algorithm to perform especially well when the states and 
innovations are sparse while retaining good performance when the innovations are denser. Furthermore, the simple 
form means that the explosion of recent work in t\ optimization methods can be leveraged for computationally 
efficient recursive updates. In particular, since no covariance matrix inversion is required and many modern t\ 
estimation methods require only matrix multiplication (and no inversion), this approach is also amenable to high- 
dimensional data analysis. 

IV. Results 

While the RWL1-DF inference scheme is a general inference tool with many potential applications, we focus 
our evaluation purposes on CS recovery. Compressive sensing offers systematic ways to specify inference problems 
(ratios of M to N with respect to the signal sparsity) that are very challenging for static inference techniques and 
adequate estimates require exploiting dynamic information. In the examples below we use both a stylized scenario 
with synthetic data and a an example on a natural video sequence. 

In all simulations, we compare the performance of RWL1-DF to the performance of independent BPDN (BPDN 
applied independently at each time step with no temporal information) and independent RWL1 (RWL1 applied 
independently at each time step with no temporal information) to highlight the benefit of including dynamic 



information. We also compare to the BPDN-DF (BPDN modified, as in Equation pO) [ 34 1, 1 35 1, to include additional 
norms which match the signal statistics: a l\ norm on the prediction error for the simulated data and a £2 norm on 
the prediction error for the video sequence) due to its similarity in implementation and philosophy to the RWL1-DF. 
Standard Kalman filtering is not shown here because it performs very poorly in these type of simulations (i.e., 



it doesn't converge to a stable estimate with these sparse statistics) [35 1. Furthermore, we do not compare to 



other methods highlighted in Section II-C (or other Kalman filter extensions) due to their incompatibility with the 
example scenarios presented here. These algorithms may not support arbitrary dynamic models (i.e., they have an 
incompatible signal models such as implicit assumptions that the system dynamics are the identity function) and 
require prohibitive computations such as inverting full covariance matrices for very high-dimensional data. 

A. Synthetic Data 

To explore the performance and robustness of RWL1-DF in detail, we first simulate inference on synthetic data 
where there is "ground truth" available and we can make controlled variations to the data characteristics. In this 
data, we generate dynamically changing hidden states by applying a different random permutation and scaling 
operator at each time step (i.e., is a known but different random permutation matrix at each k). While linear 
dynamics models are not required in the model assumptions, this simple test captures the model notions of sparse 
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state elements that have changing support locations and values with time. The simulated dynamics also include a 
sparse innovations term (i.e., dynamic model error) by mapping a small number (p) of coefficients to an incorrect 
permutation. This process simulates an innovation that is 2p-sparse (one "missing" and one "extra" coefficient per 
incorrect permutation), which allows us to evaluate the algorithm's robustness to the type of shot noise that is 
particularly challenging for Kalman filter techniques. 

With these simulated dynamics, we create lengfh-500 (N = 500) sequences that are 20-sparse (S = 20). The 
vectors are observed with Gaussian measurement matrices (with normalized columns) that are independently drawn 
at each iteration, and we add Gaussian measurement noise with variance o\ = 0.001. We vary the number of 
measurements to observe the reconstruction capability of the algorithm in highly undersampled regimes, but the 
number of measurements is always constant within a trial. All simulations average the results of 40 independent 
runs and display reconstruction results as the relative mean-squared error (rMSE): 



For independent BPDN, at each iteration we use the value A = 0.55o"g. For independent RWL1 we use Ao = 0.0011, 
r = 1 and v = 0.01. For BPDN-DF we use 7 = 0.5o^ and k = 0.001/(p+ 1). For RWL1-DF we use A = 0.0011, 
r = 1 and v = 1 — 2p/S. These parameters were optimized using a parameter sweep. 



The estimation for BPDN-DF and RWL1-DF improves with time and converges to steady state values, indicating 
that both approaches are exploiting useful dynamic information to sequentially improve over time (in contrast to 
the methods that do not incorporate dynamic information, as expected). Note that RWL1-DF reaches substantially 
lower steady-state recovery error than BPDN-DF, illustrating the net improvements gained by using second-order 
statistics in the estimation. 

To explore the performance of RWL1-DF, Figure |3jb) displays the results of varying the number of measurements 
while holding p = 3. While performance for all algorithms is comparable for large numbers of measurements, it is 
clear that exploiting temporal information can improve performance in the undersampling regime. In particular, 
RWL1-DF is able to sustain virtually the same steady-state rMSE down to much more aggressive levels of 
undersampling than even BPDN-DF. 

Finally, we explore the robustness of each algorithm to model errors by fixing the number of measurements 
(M = 70) and varying the sparsity of the innovations p. Figure |3jc) shows the results, illustrating that RWL1-DF 
uses the second-order statistics to sustain better performance than BPDN-DF when the innovations are sparse (i.e., 
shot noise). We note that when p > 4 and BPDN-DF achieves lower rMSE, the total number of model errors is 




(21) 



Figure |3|a) shows a single trial with M = 70 measurements and 2p = 6 innovation errors at each time step. 
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Fig. 3. Behavior of the RWL1-DF algorithm on synthetic data, (a) RWL1-DF converges over time to a lower mean rMSE than static sparse 
estimation or BPDN-DF. Shown for M = 70, N = 500, S = 20 and p = 3. (b) When sweeping the number of measurements M for N = 
500, S = 20, and p = 3, we observe that RWL1-DF retains lower rMSE values for low numbers of per-iteration measurements. Each point is 
the average steady state rMSE over 40 independent trials, (c) RWL1-DF is also more robust to model mismatch in the innovation statistics. 
Shown here for different innovations sparsity p for M = 70, N = 500, and S = 20. Each data point is the result of averaging the steady-state 
rMSE over 40 independent trials. Note that when BPDN-DF starts to perform better (p — 5), the innovations are actually half of the total 
support set and a Gaussian innovations model may be more accurate than a sparse innovation model. 

50% of the signal sparsity and may be better approximated by a dense (i.e., non-sparse) innovations model. We note 
that due to the method of simulating the sparse innovations, the total energy slightly increases when p is higher, 
accounting for slightly degraded performance for even the independent cases. Despite even this model mismatch, 
RWL1-DF does not degrade nearly as fast with respect to increasing p. 



B. Natural Video Sequences 

To test the utility of RWL1-DF on natural signals, we explore its performance on a portion (128 x 128 pixels) of 
the standard Foreman video sequence]^] that has been compressively sampled. In this case, we take the time-varying 
hidden state x n to be the wavelet (synthesis) coefficients at each frame of the video. While the true frame-by-frame 
dynamics of natural video are likely to be complex and non-linear, for this simulation we use a simple first-order 
model predicting that the coefficients will remain the same from one frame to the next: f x (k) = x for all k. We note 
that RWL1-DF is not specialized to this simple model, and any advanced efforts to build a more accurate model of the 
state dynamics in natural video could be easily incorporated. To illustrate the effects of the representation (especially 
in this simple dynamics model), we run two separate simulations using both the orthonormal Daubechies wavelet 



transform (DWT) and a four-times overcomplete dual-tree discrete wavelet transform (DT-DWT) |39|. Compared 
to the DWT, we expect the redundancy in DT-DWT to produce higher levels of sparsity, which will improve CS 
recovery overall. Furthermore, we also expect the DT-DWT to be more shift-invariant, leading to better performance 
of the simple dynamics model that assumes stationary coefficients from frame to frame. 

We simulate CS measurements by applying a subsampled noiselet transforrrj^] [40[ to each frame, taking M = 

6 The Foreman sequence is available at: http://www.hlevkin.com/TestVideo/foreman.yuv 

7 We use the noiselet transform because it can be computed with an efficient implicit transform and has enough similarity to a random 
measurement that it works well in CS. 
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0.27N measurements per frame where N = 128 2 . In the DWT recovery we use the following parameters: A = 0.01 
for BPDN, A = 0.001, r = 0.05 and v = 0.1 for RWL1, 7 = 0.01 and k = 0.4 for BPDN-DF, and A = 0.001, r = 
0.2 0, = 1 and v = 0.2 for RWL1-DF. In the DT-DWT recovery we use the parameters A = 0.001 for BPDN, A 
= 0.11, r = 0.2 and v = 0.1 for RWL1, 7 = 0.01 and k = 0.2 for BPDN-DF, and A = 0.003, r = 4, /3 = 1 and v 
= 1 for RWL1-DF. Again, we found these parameter setting through a parameter sweep. We solve all optimization 



programs using the TFOCS package |41| due its superior stability during RWL1 optimization and the ability to 
use implicit operators for matrix multiplications. 

Figure [4] shows the recovery of 200 consecutive frames of video. As expected, we see that all algorithms perform 
better in the DT-DWT case than the DWT case due to the increased sparsity of the representation (DT-DWT 
representations used approximately 62% of the coefficients necessary in the DWT). In both cases, RWL1-DF 
converges to the lowest steady-state rMSE and is able to largely sustain that performance over the sequence. In 
contrast, BPDN-DF cycles through periods of good performance and poor performance, sometimes performing 
worse than not using temporal information at all. In essence, BPDN-DF is not robust to model errors, and each 
time there is motion in the scene (violating the simple dynamics model) the algorithm has to re-converge. The 
RWL1-DF does not exhibit this performance oscillation because the use of second-order statistics to propagate 
temporal information is less rigid, allowing for more robustness when the model is in error. As expected, while 
BPDN-DF is still susceptible to model errors, the fragility in BPDN-DF is somewhat mitigated when using the 
DT-DWT because the simple model is more accurate in this case. 

To summarize the performance over the entire video sequence, Figure [5] shows histogram plots of the rMSE 
values over the 200 frames for each recovery process. The mean and median for each histogram are represented by 
the green dashed line and the red arrow respectively. As expected, the algorithms with no temporal information are 
immune to errors in the dynamic model (since it is not used), reflected in the fact that the mean and median are 
virtually the same in each of these cases. However, with BPDN-DF the errors are much more spread out, achieving 
nearly the same median error as the independent algorithms and having a much higher mean error due to the large 
excursions. In other words, unless a more accurate (and complex) dynamic model can be used, BPDN-DF actually 
may be a worse choice than not using any temporal information at all. In contrast, RWL1-DF shows a much tighter 
distribution of errors, having a mean and median significantly lower than the other approaches. Additionally the 
statistics for the RWL1-DF do not change much with the loss of redundancy (and therefore the accompanying 
temporal regularity), indicating that similar performance can be achieved with smaller state spaces. 
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Fig. 4. CS recovery of the full Foreman video sequence shows that RWL1-DF is more robust to changes in the video sequence. Each 
curve represents the rMSE for recovery from subsampled noiselets (M/N = 0.27) using either a DWT (top) or a DT-DWT (bottom) as the 
sparsifying basis. The independent BPDN recovery (dotted blue curve) and the independent re-weighted BPDN (solid green curve) retain a 
steady rMSE over time. RWL1-DF (dashed cyan curve) converges on a lower rMSE than either time-independent estimation and remains at 
approximately steady-state for the remainder of the video sequence. BPDN-DF (the dot-dash red curve) can converge to low rMSE values, but 
is highly unstable and can yield very poor results when the model is not regular. While moving from the four-times overcomplete DT-DWT 
to the DWT, all algorithms except RWL1-DF (but especially BPDN-DF) suffer in performance due to the dynamic signal model being more 
inaccurate. 



V. CONCLUSIONS 

In this work we propose a causal dynamic inference scheme for sparse models that reflects the spirit of Kalman 
filtering by propagating second-order statistics in a computationally efficient recursive update. This approach is 
based on hierarchical probabilistic models, and attempts to integrate the temporal information leveraged in dynamic 
filtering with the sparse signal structure that has been so successful in modern signal processing. We demonstrate 
that this algorithm achieves superior reconstruction rates and more robustness to sparse model mismatch (i.e., shot 
noise) than other comparable schemes. 

In addition to improving recovery performance, RWL1-DF has the further advantage of being computationally 
efficient. There is no need to calculate or store full covariance matrices as required in Kalman filtering-based 
approaches, making it more appropriate for high-dimensional data such as large video sequences. Instead, RWL1- 
DF only requires storage of a set of variances that is the same size as the number of variables (instead of the square 
of this dimension as required by a covariance matrix). The core computations of this approach center around RWL1 
optimization, meaning that it can take advantage of modern software packages for efficient l\ minimization (e.g. 



TFOCS or NESTA |41|, [42 1) that often exploit implicit operators (instead of explicit matrices). Though multiple 
l\ minimization problems must be solved for the EM algorithm at each time step, our experience was that no 
more than 10 (and usually no more than 5) iterations of the EM algorithm must be performed at each time step 
to converge. Recent applications of homotopy algorithms could further improve the speed of performing RWL1 
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Fig. 5. Histogram of the achieved errors for both the DWT and the DT-DWT as the sparsifying basis. RWL1-DF not only achieves a 
lower mean (indicated by the dashed green lines) and median (indicated by the red arrows), but the overall distribution is concentrated at 
lower rMSE values for both cases. On the other hand, BPDN-DF achieves a median approximately the same as (independent) RWL1, but 
the distribution of errors is spread out due to the lack of robustness in the estimation, resulting in a higher mean. 



via low rank updates [43]. In addition to the digital efficiency, the RWL1-DF can benefit from recent advances 
in analog implementations of optimization programs. Analog designs and small-scale implementations have been 



shown for solving BPDN in a fraction of the time of digital solvers [17| and a similar continuous time system has 
been established to solve RWL1 p4[ . Since RWL1-DF essentially solves a RWL1 optimization at each step, these 
systems can potentially be used to implement RWL1-DF in analog hardware to enable real-time inference. 

The sum total of these results lead us to conclude that hierarchical sparsity models such as the LSM may be a 
very general framework to successfully incorporate higher-order signal information, including (but not limited to) 
temporal regularity. As with any inference algorithm, performance will improve with a model that captures more 
structure in the data. For applications in CS, significant future work will center around developing more accurate 
dynamic models to capture more structure in the temporal regularity. Fortunately, the RWL1-DF approach is not 
limited to linear dynamics models, and more complex nonlinear models of signal structure can be incorporated 
with straightforward modifications to the examples presented here. 
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